Predicting house prices

In this lesson, we'll explore the boston housing dataset (which is built into sklearn), and walk through some basic principles of setting up and building, tuning and selecting a valid machine learning model.

This lesson will use sklearn in conjunction with several skutil preprocessing techniques.


In [2]:
from __future__ import print_function, division
import sklearn
import skutil
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

print('sklearn: %s' % sklearn.__version__)
print('skutil:  %s' % skutil.__version__)
print('pandas:  %s' % pd.__version__)
print('numpy:   %s' % np.__version__)


sklearn: 0.17.1
skutil:  0.0.13
pandas:  0.18.1
numpy:   1.11.0

Loading and inspecting our data

We can load our pandas dataframe and examine a sample of the data we'll be working with. On first glance, it appears every feature is numeric. This is certainly cleaner than most realworld datasets!


In [3]:
from sklearn.datasets import load_boston

boston = load_boston()
X = pd.DataFrame.from_records(data=boston.data, columns=boston.feature_names)
X.head()


Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Let's examine the first few values of our target variable. Notice the value is a real number and not a class, indicating we'll we using regressive models and not classification.


In [4]:
y = boston.target
y[:5]


Out[4]:
array([ 24. ,  21.6,  34.7,  33.4,  36.2])

By examining the dtypes (data types) attribute of the dataframe, our suspicion is confirmed: all of the features are in fact numeric. Below, we also take a look at whether there are any missing values. Luckily, in this example there are not.


In [5]:
X.dtypes


Out[5]:
CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
dtype: object

In [6]:
X.isnull().sum()


Out[6]:
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

In [7]:
X.describe()


Out[7]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000

Cleaning & preprocessing data

Uncommonly—if not never—will your data not need any sort of preprocessing. Be it noisy features, skewed variables, redundant or uninformative features, your data will generally always need some massaging.

One thing to be aware of is whether your input is shuffled or in order... if your data is ordered, know why. For example, let's build a RandomForestRegressor on the first few rows of X, and validate the model on the last rows:


In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

def rmse(act, pred):
    return np.sqrt(mean_squared_error(act, pred))

# define the model
model = RandomForestRegressor(random_state=42)

# fit the model
model.fit(X[:350], y[:350])

# assess performance
print('Train R^2: %.5f'  % r2_score(y[:350], model.predict(X[:350])))
print('Train RMSE: %.5f\n' % rmse(y[:350], model.predict(X[:350])))

print('Test R^2: %.5f'  % r2_score(y[350:], model.predict(X[350:])))
print('Test RMSE: %.5f' % rmse(y[350:], model.predict(X[350:])))


Train R^2: 0.97543
Train RMSE: 1.32625

Test R^2: 0.16424
Test RMSE: 7.44935

Notice the extreme drop-off in validation performance! It's likely there are phenomena in the test data that were not observed in the training data, and the model was not induced to capture such nuances.

The train_test_split

The sooner you can split your data, the better. sklearn provides a built-in mechanism for just this: sklearn.cross_validation.train_test_split. This will split your data to the specified size, and shuffle the observations at the same time.

Notice we create three splits:

  • Train—the set on which we'll develop our model
  • Validate—the set on which we assess performance of our model
  • Test—the set we'll use to select our final model. We don't touch this until the end!

In [9]:
from sklearn.cross_validation import train_test_split

tr_size = int(0.6 * X.shape[0])
va_te_size = int((X.shape[0] - tr_size) / 2)

# split the train/val and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=va_te_size) 

# split the train/val apart
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=va_te_size)

print('Train size: %i' % X_train.shape[0])
print('Validation size: %i' % X_val.shape[0])
print('Holdout size: %i' % X_test.shape[0])


Train size: 304
Validation size: 101
Holdout size: 101

In [10]:
# fit the model
model.fit(X_train, y_train)

# assess performance
print('Train R^2: %.5f'  % r2_score(y_train, model.predict(X_train)))
print('Train RMSE: %.5f\n' % rmse(y_train, model.predict(X_train)))

print('Val R^2: %.5f'  % r2_score(y_val, model.predict(X_val)))
print('Val RMSE: %.5f' % rmse(y_val, model.predict(X_val)))


Train R^2: 0.96157
Train RMSE: 1.76292

Val R^2: 0.81943
Val RMSE: 4.00728

Notice that our validation performance is now much more similar to our training performance.

Note: It is bad practice to evaluate your model against your test set whilst modeling, so we use our validation set to examine incremental performance

Transforming features

How can we make this model perform better? There may be some strange/skewed distributions within our data that we could coerce into a more normal shape. Let's take a look at just a few (you could do this for all features, but for the sake of example, we'll only look at a handful).


In [11]:
# start by defining a very simple histogram function
def hist(x, scale = 1, style = 'darkgrid', left = None, right = None, xlab='Count', ylab='Y'):
    x = x if not isinstance(x, pd.Series) else x.tolist()
    
    figure = plt.figure()
    
    sns.set(style = style)
    sns.distplot(x * scale, hist = True, kde = False, norm_hist = True)
    
    ax = figure.get_axes()[0]
    ax.set_xlim(left  = left  or int(np.ceil(np.min(x))), 
                right = right or int(np.ceil(np.max(x))))
    
    ax.set_xlabel(xlab)
    ax.set_ylabel(ylab)

Notice the crime feature is quite skewed. We may be able to make it more normal with a BoxCoxTransformer. We can apply this technique to other features as well, but it is not always guaranteed to work well.


In [12]:
hist(x=X_train.CRIM, ylab='Crime rate')



In [14]:
from skutil.preprocessing import BoxCoxTransformer

hist(BoxCoxTransformer(cols=['CRIM']).fit_transform(X_train).CRIM.tolist(), ylab='Crime transformed')


Feature selection

How do we know which features to retain? In this toy example, we have a manageable amount of features, however in the text analytics or computer vision domains, we often have >100,000 features. Let's explore techniques for reducing this high dimensionality without impacting the predictive power of our model (in no particular order):

1. Eliminate multicollinearity:


In [15]:
from skutil.feature_selection import MulticollinearityFilterer

# let's see if any features are collinear with one another:
fltr = MulticollinearityFilterer(threshold=0.9).fit(X_train)

# examine the drop attribute
fltr.drop


Out[15]:
['TAX']

The MulticollinearityFilterer searches through a correlation matrix for any correlations greater than the provided threshold. When a high correlation is observed between two variables, the function examines the mean absolute correlation of each feature and removes the one that is most highly-correlated with other features as well.

2. Eliminate features with near zero variance:


In [16]:
from skutil.feature_selection import NearZeroVarianceFilterer

# define and fit the filterer
fltr = NearZeroVarianceFilterer(threshold=1e-4).fit(X_train)

# examine the dropped cols
fltr.drop

Notice there are no features with variance less than the threshold, so the result was None. If we wanted, we could adjust that threshold.

3. PCA (Principal Component Analysis)

(Note that this isn't actually a feature selection technique, but a feature reduction technique that results in a set of features which are linear combinations of the original input space)


In [17]:
from skutil.decomposition import SelectivePCA

# define and fit
pca = SelectivePCA(n_components=0.85).fit(X_train)

# examine the head
pca.transform(X_train).head()


Out[17]:
PC1 PC2
0 -238.102956 110.325234
1 138.802417 -3.246269
2 89.139731 9.242997
3 111.718522 -15.849379
4 110.971250 8.055599

Notice we effectively reduced the input space to two dimensions while retaining at least 85% of the variability in the matrix!

Putting it all together in a model

How can we use all of the aforementioned techniques to preprocess our data prior to modeling?


In [20]:
from skutil.preprocessing import SelectiveScaler

# multicollinearity
mcf = MulticollinearityFilterer(threshold=0.9).fit(X_train)
mcf_train = mcf.transform(X_train)

# near zero variance
nzv = NearZeroVarianceFilterer(threshold=1e-4).fit(mcf_train)
nzv_train = nzv.transform(mcf_train)

# add a step: scaling
scl = SelectiveScaler().fit(nzv_train)
scl_train = scl.transform(nzv_train)

# pca
pca = SelectivePCA(n_components=0.85).fit(scl_train)
pca_train = pca.transform(scl_train)

# fit the model
model.fit(pca_train, y_train)

# assess performance on validation set
print('Train R^2: %.5f'  % r2_score(y_train, model.predict(pca_train)))
print('Train RMSE: %.5f\n' % rmse(y_train, model.predict(pca_train)))


Train R^2: 0.95395
Train RMSE: 1.92976

That's nice, but it's kind of a mess. What if we have a ton of preprocessors to keep track of? Things could get hairy. Furthermore, if we want to assess performance on our validation set, we have to get just as many intermediate predictions. What a pain!

That's what the Pipeline object is for. Pipeline stores a sequence of named transformers with an optional BaseEstimator as the last element. The only arg in the Pipeline constructor is a single list of tuples:

pipe = Pipeline([
    ('name_of_first_step',  FirstTransformer()),
    ('name_of_second_step', SecondTransformer())
])

In [21]:
from sklearn.pipeline import Pipeline

# define our pipe
pipe = Pipeline([
        ('mc',  MulticollinearityFilterer(threshold=0.9)),
        ('nzv', NearZeroVarianceFilterer(threshold=1e-4)),
        ('sc',  SelectiveScaler()),
        ('pca', SelectivePCA(n_components=0.85)),
        ('rf',  RandomForestRegressor(random_state=42))
    ])

# fit our pipeline
pipe.fit(X_train, y_train)

# assess performance
print('Train R^2: %.5f'  % r2_score(y_train, pipe.predict(X_train)))
print('Train RMSE: %.5f\n'   % rmse(y_train, pipe.predict(X_train)))

print('Validation R^2: %.5f'  % r2_score(y_val, pipe.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, pipe.predict(X_val)))


Train R^2: 0.95395
Train RMSE: 1.92976

Validation R^2: 0.70628
Validation RMSE: 5.11078

Notice we get the exact same results, but the code is much more elegant with fewer intermediate variables lying around. We can also assess performance on our validation set.

However, at a closer inspection, we can see that our results are not as good as they were before preprocessing. Presumably, if we could tweak our preprocessing hyperparameters to optimize our algorithm, we could identify a model with better performance. Furthermore, the astute will note that we have not yet introduced any cross validation:


In [22]:
from sklearn.cross_validation import KFold

# the default sklearn cross validation does NOT shuffle, and you know how we feel about that...
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)

Now we introduce the grid search, the mechanism by which we will search over a random space of hyperparameters, building cross-validated models at each iteration and retaining the model that performs best.


In [23]:
# make sure to use the SKUTIL grid search for DF compatability, and not the SKLEARN one.
from skutil.grid_search import RandomizedSearchCV
from scipy.stats import uniform, randint
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# set our CV
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)

# define our pipe -- let's remove our BC step
pipe = Pipeline([
        ('mc',  MulticollinearityFilterer()),
        ('nzv', NearZeroVarianceFilterer()),
        ('sc',  SelectiveScaler()),
        ('pca', SelectivePCA()),
        ('rf',  RandomForestRegressor(random_state=42))
    ])

# let's define the hyperparameters we'll search over. Notice the form of:
# '<stage_nm>__<arg_nm>'
hyperparams = {
    'mc__threshold'        : uniform(0.95, 0.05),
    'nzv__threshold'       : [1e-4, 1e-2],
    'sc__scaler'           : [StandardScaler(), RobustScaler(), MinMaxScaler()],
    'pca__n_components'    : randint(4, X.shape[1]),
    'pca__whiten'          : [True, False],
    'rf__n_estimators'     : randint(50, 100),
    'rf__max_depth'        : randint(4, 15),
    'rf__min_samples_leaf' : randint(1, 10),
    'rf__min_samples_split': randint(2, 5),
    'rf__max_features'     : uniform(loc=.5, scale=.5),
    'rf__max_leaf_nodes'   : randint(10,50)
}

# define and fit
search = RandomizedSearchCV(pipe, 
                            hyperparams, 
                            cv=custom_cv, 
                            scoring='r2',
                            random_state=42, 
                            n_iter=30)
search.fit(X_train, y_train)

# assess performance
print('Validation R^2: %.5f'  % r2_score(y_val, search.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, search.predict(X_val)))


Validation R^2: 0.72060
Validation RMSE: 4.98469

We can actually view our grid results like so:


In [24]:
from skutil.utils import report_grid_score_detail
report_grid_score_detail(random_search=search, charts=True)


Out[24]:
mc__threshold nzv__threshold pca__n_components pca__whiten rf__max_depth rf__max_features rf__max_leaf_nodes rf__min_samples_leaf rf__min_samples_split rf__n_estimators sc__scaler score std
7 0.988241 0.0100 9 False 8 0.710470 45 2 4 99 RobustScaler(copy=True, with_centering=True, w... 0.791127 0.173148
22 0.987902 0.0100 8 True 11 0.637820 29 1 3 78 StandardScaler(copy=True, with_mean=True, with... 0.782222 0.171858
20 0.967801 0.0100 8 True 6 0.880463 14 2 2 80 RobustScaler(copy=True, with_centering=True, w... 0.779109 0.176249
23 0.989798 0.0100 10 True 13 0.626435 48 2 3 93 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.761346 0.185347
26 0.987078 0.0100 11 False 12 0.634674 26 4 2 88 RobustScaler(copy=True, with_centering=True, w... 0.758516 0.138921
19 0.983026 0.0100 6 False 5 0.522999 31 4 4 66 RobustScaler(copy=True, with_centering=True, w... 0.755499 0.162324
17 0.977988 0.0001 12 True 7 0.909229 28 6 2 65 StandardScaler(copy=True, with_mean=True, with... 0.750730 0.187991
8 0.979696 0.0100 6 False 4 0.552620 47 2 4 65 StandardScaler(copy=True, with_mean=True, with... 0.749867 0.178026
0 0.983156 0.0001 6 False 13 0.618107 31 5 3 64 StandardScaler(copy=True, with_mean=True, with... 0.747108 0.137729
3 0.981160 0.0001 12 True 11 0.953544 20 3 3 76 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.740208 0.220455
4 0.956413 0.0001 12 False 8 0.710228 15 6 2 60 StandardScaler(copy=True, with_mean=True, with... 0.738371 0.185873
18 0.981937 0.0001 4 False 10 0.501948 11 3 4 64 StandardScaler(copy=True, with_mean=True, with... 0.728955 0.172770
13 0.995750 0.0100 12 False 13 0.525978 14 3 4 53 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.728007 0.208564
24 0.963270 0.0100 4 True 12 0.897112 28 1 4 50 RobustScaler(copy=True, with_centering=True, w... 0.719867 0.235551
1 0.979604 0.0001 12 True 7 0.947611 12 9 4 76 StandardScaler(copy=True, with_mean=True, with... 0.712489 0.222223
10 0.968305 0.0100 9 True 7 0.554112 15 6 2 74 StandardScaler(copy=True, with_mean=True, with... 0.711891 0.140203
2 0.961227 0.0001 11 True 13 0.897544 38 8 4 53 RobustScaler(copy=True, with_centering=True, w... 0.711576 0.126452
12 0.972200 0.0001 8 False 9 0.559725 20 4 4 92 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.710379 0.199472
27 0.991030 0.0100 8 False 5 0.750453 45 9 2 96 RobustScaler(copy=True, with_centering=True, w... 0.706907 0.111591
25 0.963694 0.0100 5 False 4 0.825448 41 8 3 56 RobustScaler(copy=True, with_centering=True, w... 0.705611 0.160387
15 0.990545 0.0100 11 False 11 0.600034 45 8 4 69 StandardScaler(copy=True, with_mean=True, with... 0.703879 0.169953
9 0.954097 0.0001 9 False 12 0.662525 19 8 3 71 RobustScaler(copy=True, with_centering=True, w... 0.703771 0.150605
28 0.973833 0.0001 7 True 14 0.810080 35 9 2 85 RobustScaler(copy=True, with_centering=True, w... 0.700193 0.121987
6 0.973238 0.0100 9 True 6 0.995173 45 9 4 73 RobustScaler(copy=True, with_centering=True, w... 0.698918 0.131617
11 0.972257 0.0001 12 True 11 0.938802 48 8 4 78 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.689296 0.203342
29 0.976383 0.0100 6 False 14 0.735929 10 9 3 87 RobustScaler(copy=True, with_centering=True, w... 0.681497 0.141443
21 0.957156 0.0001 6 False 5 0.872918 10 4 3 78 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.681335 0.234420
14 0.984599 0.0100 6 True 11 0.722688 26 7 2 69 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.677875 0.184055
5 0.994759 0.0100 7 False 5 0.545275 14 9 3 62 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.639748 0.141657
16 0.979921 0.0001 5 True 4 0.683391 11 8 4 56 MinMaxScaler(copy=True, feature_range=(0, 1)) 0.506704 0.340709

In viewing these results, we can make educated decisions on refining our grid such that we don't waste time searching over parameters that detrimentally impact the performance.


In [25]:
# set our CV
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)

# define our pipe
pipe = Pipeline([
        ('mc',  MulticollinearityFilterer()),
        ('nzv', NearZeroVarianceFilterer()),
        ('sc',  SelectiveScaler(scaler=StandardScaler())), # set to robust
        ('pca', SelectivePCA()),
        ('rf',  RandomForestRegressor(random_state=42))
    ])

# we can narrow our search parameters now
hyperparams = {
    'mc__threshold'        : uniform(0.95, 0.05),
    'nzv__threshold'       : [1e-4, 1e-2],
    'sc__scaler'           : [StandardScaler(), RobustScaler(), MinMaxScaler()],
    'pca__n_components'    : randint(8, X.shape[1]),
    'pca__whiten'          : [True, False],
    'rf__n_estimators'     : randint(75, 100),
    'rf__max_depth'        : randint(4, 15),
    'rf__min_samples_leaf' : randint(1, 8),
    'rf__min_samples_split': randint(2, 5),
    'rf__max_features'     : uniform(loc=.5, scale=.5),
    'rf__max_leaf_nodes'   : randint(25,50)
}

# define and fit
search = RandomizedSearchCV(pipe, 
                            hyperparams, 
                            cv=custom_cv, 
                            scoring='r2',
                            random_state=42, 
                            n_iter=30) # incremented to 20 X
search.fit(X_train, y_train)

# assess performance
print('Validation R^2: %.5f'  % r2_score(y_val, search.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, search.predict(X_val)))


Validation R^2: 0.73432
Validation RMSE: 4.86073

Trying a different model

We have a pretty good RandomForestRegressor at this point... why don't we try a different model?


In [26]:
from sklearn.ensemble import GradientBoostingRegressor

# define our pipe
gbm_pipe = Pipeline([
        ('mc',  MulticollinearityFilterer(threshold=0.9)),
        ('nzv', NearZeroVarianceFilterer(threshold=1e-4)),
        ('sc',  SelectiveScaler()),
        ('pca', SelectivePCA(n_components=0.85)),
        ('gbm', GradientBoostingRegressor(random_state=42))
    ])

# let's define the hyperparameters we'll search over.
gbm_hyperparams = {
    'mc__threshold'        : uniform(0.80, 0.15),
    'sc__scaler'           : [StandardScaler(), RobustScaler(), MinMaxScaler()],
    'pca__n_components'    : uniform(0.95, 0.05),
    'gbm__n_estimators'    : randint(90, 200),
    'gbm__learning_rate'   : uniform(0.075, 0.05),
    'gbm__max_depth'       : randint(2, 7), # we grow these more shallow
}

# define and fit
gbm_search = RandomizedSearchCV(gbm_pipe, 
                                gbm_hyperparams, 
                                cv=custom_cv, 
                                scoring='r2',
                                random_state=42, 
                                n_iter=30)
gbm_search.fit(X_train, y_train)

# assess performance
print('Validation R^2: %.5f'  % r2_score(y_val, gbm_search.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, gbm_search.predict(X_val)))


Validation R^2: 0.80500
Validation RMSE: 4.16424

Wow, that looks fantastic! However, GBMs are much more likely to overfit the data. We'll need to see how it performs on the holdout set to determine whether it's actually a good model.

Evaluating performance on our holdout set

This happens once. When you've built a selection of models, we evaluate each one time against the holdout set for our final model selection.


In [27]:
# examine RF performance
print('RF test R^2: %.5f'    % r2_score(y_test, search.predict(X_test)))
print('RF test RMSE: %.5f\n' % rmse(y_test,     search.predict(X_test)))

# examine GBM performance
print('GBM test R^2: %.5f'  % r2_score(y_test, gbm_search.predict(X_test)))
print('GBM test RMSE: %.5f' % rmse(y_test,     gbm_search.predict(X_test)))


RF test R^2: 0.69413
RF test RMSE: 5.18086

GBM test R^2: 0.72310
GBM test RMSE: 4.92935

In our RF, our validation error closely resembles that of our holdout error, which is indicative that we are not overfitting. However, our GBM indicates otherwise. We can tune our hyperparameters to fix this, but for the sake of example, we will not in this demo.

See the next demo for information on how to persist models to disk.


In [ ]: